Read in the
gapminder_clean.csvdata as atibbleusingread_csv.
There are a total of 19 variables in 2607 rows in the dataset. Most of the variables are continuous, while only Country Name, Year and continent are discrete (All discrete variables will be treated as factors in the remaining of this analysis).
Empty strings were converted to NA values. It can be observed that there is an important number of NA values in the dataset across almost all variables. NA values are later filtered for the purpose of specific analyses.
There is a total of 263 countries across 10 specific years spanning from 1962 to 2007 in increments of 5 (e.g. 1962, 1967, 1972, etc), including 6 continents.
df <- as_tibble(read.csv("gapminder_clean.csv",
check.names = FALSE,
stringsAsFactors = TRUE
)[, -1]) %>%
mutate_all(~ na_if(., "")) %>%
mutate(Year = as.factor(Year))
dim(df)
## [1] 2607 19
data.frame(
`Unique values` = sapply(lapply(df, unique), length),
`NA count` = colSums(is.na(df)), Type = sapply(df, class),
check.names = FALSE
)
## Unique values NA count
## Country Name 263 0
## Year 10 0
## Agriculture, value added (% of GDP) 1395 1179
## CO2 emissions (metric tons per capita) 2174 414
## Domestic credit provided by financial sector (% of GDP) 1692 864
## Electric power consumption (kWh per capita) 1338 1238
## Energy use (kg of oil equivalent per capita) 1380 1197
## Exports of goods and services (% of GDP) 1771 798
## Fertility rate, total (births per woman) 1939 184
## GDP growth (annual %) 1880 691
## Imports of goods and services (% of GDP) 1771 798
## Industry, value added (% of GDP) 1388 1189
## Inflation, GDP deflator (annual %) 1671 706
## Life expectancy at birth, total (years) 2381 189
## Population density (people per sq. km of land area) 2537 49
## Services, etc., value added (% of GDP) 1388 1186
## pop 1285 1323
## continent 6 1323
## gdpPercap 1285 1323
## Type
## Country Name factor
## Year factor
## Agriculture, value added (% of GDP) numeric
## CO2 emissions (metric tons per capita) numeric
## Domestic credit provided by financial sector (% of GDP) numeric
## Electric power consumption (kWh per capita) numeric
## Energy use (kg of oil equivalent per capita) numeric
## Exports of goods and services (% of GDP) numeric
## Fertility rate, total (births per woman) numeric
## GDP growth (annual %) numeric
## Imports of goods and services (% of GDP) numeric
## Industry, value added (% of GDP) numeric
## Inflation, GDP deflator (annual %) numeric
## Life expectancy at birth, total (years) numeric
## Population density (people per sq. km of land area) numeric
## Services, etc., value added (% of GDP) numeric
## pop numeric
## continent factor
## gdpPercap numeric
Most continuous variables show a somewhat exponential distribution. The exceptions being: Fertility rate, total (births per woman) (somewhat uniform), Life expectancy at birth, total (years) (somewhat normal with a skew), and Services, etc., value added (% of GDP) (normal).
df %>%
dplyr::select(where(is.numeric)) %>%
gather() %>%
filter(!is.na(value)) %>%
ggplot(aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = "free")
Filter the data to include only rows where
Yearis1962and then make a scatter plot comparing'CO2 emissions (metric tons per capita)'andgdpPercapfor the filtered data.
After filtering for the desired year, there is corresponding data for 108 countries (removing NA values). gdpPercap ranges from 355 to 95,458, and showing a fairly linear relationship with CO2 emissions (metric tons per capita) which ranges from 0.01 to 42.64.
df_1962 <- df %>%
filter(Year == 1962) %>%
filter(!is.na(gdpPercap) &
!is.na(`CO2 emissions (metric tons per capita)`)) %>%
dplyr::select(
`Country Name`, Year, gdpPercap,
`CO2 emissions (metric tons per capita)`
)
summary(df_1962)
## Country Name Year gdpPercap
## Afghanistan: 1 1962 :108 Min. : 355.2
## Albania : 1 1967 : 0 1st Qu.: 1064.7
## Algeria : 1 1972 : 0 Median : 2510.7
## Angola : 1 1977 : 0 Mean : 5173.3
## Argentina : 1 1982 : 0 3rd Qu.: 6157.1
## Australia : 1 1987 : 0 Max. :95458.1
## (Other) :102 (Other): 0
## CO2 emissions (metric tons per capita)
## Min. : 0.00848
## 1st Qu.: 0.15407
## Median : 0.37623
## Mean : 2.35303
## 3rd Qu.: 2.61142
## Max. :42.63712
##
p <- df_1962 %>%
ggplot(aes(x = gdpPercap, y = `CO2 emissions (metric tons per capita)`)) +
geom_point(aes(text = `Country Name`)) +
labs(title = "CO2 emissions vs GDP", x = "GPD Per Capita") +
theme_bw() +
scale_y_log10() +
scale_x_log10() +
geom_smooth(method = lm)
ggplotly(p)
## `geom_smooth()` using formula 'y ~ x'
On the filtered data, calculate the correlation of
'CO2 emissions (metric tons per capita)'andgdpPercap. What is the correlation and associated p value?
The correlation between said variables is 0.9261 with a p-value < 2.2e-16 (highly significant), meaning the data is well correlated.
cor.test(df_1962$gdpPercap, df_1962$`CO2 emissions (metric tons per capita)`,
use = "complete.obs"
)
##
## Pearson's product-moment correlation
##
## data: df_1962$gdpPercap and df_1962$`CO2 emissions (metric tons per capita)`
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
On the unfiltered data, answer “In what year is the correlation between
'CO2 emissions (metric tons per capita)'andgdpPercapthe strongest?” Filter the dataset to that year for the next step…
Correlation between 'CO2 emissions (metric tons per capita)' and gdpPercap is the strongest on 1967.
df %>%
group_by(Year) %>%
summarize(Correlation = cor(gdpPercap,
`CO2 emissions (metric tons per capita)`,
use = "complete.obs"
)) %>%
arrange(desc(Correlation))
## # A tibble: 10 x 2
## Year Correlation
## <fct> <dbl>
## 1 1967 0.939
## 2 1962 0.926
## 3 1972 0.843
## 4 1982 0.817
## 5 1987 0.810
## 6 1992 0.809
## 7 1997 0.808
## 8 2002 0.801
## 9 1977 0.793
## 10 2007 0.720
Using
plotly, create an interactive scatter plot comparing'CO2 emissions (metric tons per capita)'andgdpPercap, where the point size is determined bypop(population) and the color is determined by thecontinent. You can easily convert anyggplotplot to aplotlyplot using theggplotly()command.
Corresponding data exist for 113 countries for year 1967. The relationship between 'CO2 emissions (metric tons per capita)' and gdpPercap remains highly linear.
df_1967 <- df %>%
filter(Year == 1967) %>%
filter(!is.na(gdpPercap) &
!is.na(`CO2 emissions (metric tons per capita)`)) %>%
dplyr::select(
`Country Name`, Year, gdpPercap,
`CO2 emissions (metric tons per capita)`, continent, pop
)
summary(df_1967)
## Country Name Year gdpPercap
## Afghanistan: 1 1967 :113 Min. : 349
## Albania : 1 1962 : 0 1st Qu.: 1136
## Algeria : 1 1972 : 0 Median : 2742
## Angola : 1 1977 : 0 Mean : 5768
## Argentina : 1 1982 : 0 3rd Qu.: 7656
## Australia : 1 1987 : 0 Max. :80895
## (Other) :107 (Other): 0
## CO2 emissions (metric tons per capita) continent pop
## Min. : 0.0118 : 0 Min. : 70787
## 1st Qu.: 0.1813 Africa :43 1st Qu.: 2427334
## Median : 0.5749 Americas:23 Median : 5212416
## Mean : 2.7235 Asia :23 Mean : 25359863
## 3rd Qu.: 4.4315 Europe :22 3rd Qu.: 12607312
## Max. :43.4283 Oceania : 2 Max. :754550000
##
p <- df_1967 %>%
ggplot(aes(x = gdpPercap, y = `CO2 emissions (metric tons per capita)`)) +
geom_point(aes(text = `Country Name`, col = continent, size = pop)) +
labs(title = "CO2 emissions vs GDP", x = "GPD Per Capita") +
theme_bw() +
scale_y_log10() +
scale_x_log10() +
geom_smooth(method = lm)
ggplotly(p)
## `geom_smooth()` using formula 'y ~ x'
What is the relationship between
continentand'Energy use (kg of oil equivalent per capita)'? (stats test needed)
There is a total of 848 row without NA values. Observation count is fairly balanced between continents with the exception of Oceania, where there is data for only 20 rows.
df_continent <- df %>%
filter(!is.na(continent) &
!is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
dplyr::select(
continent,
`Energy use (kg of oil equivalent per capita)`, pop,
`Country Name`
)
dim(df_continent)
## [1] 848 4
summary(df_continent)
## continent Energy use (kg of oil equivalent per capita)
## : 0 Min. : 9.715
## Africa :199 1st Qu.: 484.129
## Americas:188 Median : 995.564
## Asia :185 Mean : 1992.607
## Europe :256 3rd Qu.: 2895.659
## Oceania : 20 Max. :14746.031
##
## pop Country Name
## Min. :1.821e+05 Australia: 10
## 1st Qu.:4.437e+06 Austria : 10
## Median :1.015e+07 Belgium : 10
## Mean :4.389e+07 Canada : 10
## 3rd Qu.:3.155e+07 Denmark : 10
## Max. :1.319e+09 Finland : 10
## (Other) :788
A linear regression for this data set shows that there are statistically significant differences (p-value: < 2.2e-16) across continents in energy use. Being Oceania the continent with the highest consumption, and the lowest being Africa (which is the base, i.e. just intercept value). The regression coefficients show that, for example Asia consumes, in average, 1169 kg of oil equivalent yearly per person more than Africa, and 164 kg (1169-1005) more than the Americas.
Looking at the diagnostic plots, the Residuals vs Fitted shows a somewhat horizontal line (with a small valley), hinting that the relationship between the variables is fairly linear but variables could use transformation (explored below), this is supported by the observed separation (points vs dotted line) at the right side of the Normal Q-Q plot.
The Scale-Location plot hints heteroscedasticity (non constant variance), which is not surprising because all the others variables in the data set are left out, including a time variable (Year), which could hint that the mean of 'Energy use (kg of oil equivalent per capita)' does not remain constant over time (which is to be expected).
Finally Residuals vs Leverage plot does not show points beyond Cook’s distance, hinting that outliers are not to be expected (although there is a set of observations with high leverage, which could be further explored).
Overall, adjusted R-squared is 0.1924 meaning that ~19% of observed variance in 'Energy use (kg of oil equivalent per capita)' is explained by continent.
model_base <- lm(
`Energy use (kg of oil equivalent per capita)` ~ continent,
df_continent
)
summary(model_base)
##
## Call:
## lm(formula = `Energy use (kg of oil equivalent per capita)` ~
## continent, data = df_continent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2796.0 -1107.5 -349.1 276.8 12904.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.5 137.2 5.090 4.42e-07 ***
## continentAmericas 1005.1 196.9 5.105 4.10e-07 ***
## continentAsia 1168.8 197.7 5.911 4.93e-09 ***
## continentEurope 2447.5 183.0 13.377 < 2e-16 ***
## continentOceania 3281.8 454.1 7.227 1.11e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1936 on 843 degrees of freedom
## Multiple R-squared: 0.1963, Adjusted R-squared: 0.1924
## F-statistic: 51.46 on 4 and 843 DF, p-value: < 2.2e-16
plot(model_base)
### Log transforming variables
In this case, the only variable that could be transformed is Energy use (kg of oil equivalent per capita), as continent is discrete.
Log transforming energy use improved the fit, yielding an adjusted R-squared of 0.3553 (a significant increase from 0.1924).
The better fit is also evidenced by improvement in the diagnostic plots. As can be seen in Residuals vs Fitted showing a more horizontal line with a smaller peak than previously seen, and Normal Q-Q plot showing that the right tail of points follow the dotted line closer.
Scale-Location and Residuals vs Leverage remain similar to what was previously seen.
Overall, this model shows benefits from log transforming Energy use (kg of oil equivalent per capita) as more variance can be explained by continent.
While interpreting the coefficients, as shown below, for the log transformed model it is important to take into account that the unit is in log10 scale, meaning, for example in the case of Oceania, that the log10 of Energy use (kg of oil equivalent per capita) increases by 0.85552 from its base (Africa/the intercept) when the country is in Oceania. As an example, an estimation of the energy use when the country is in Oceania would be: \(10^{(2.72558+0.85552)}\)
model_log <- lm(log10(`Energy use (kg of oil equivalent per capita)`) ~
continent, df_continent)
summary(model_log)
##
## Call:
## lm(formula = log10(`Energy use (kg of oil equivalent per capita)`) ~
## continent, data = df_continent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73812 -0.21810 -0.03864 0.17894 1.16710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72558 0.02627 103.746 < 2e-16 ***
## continentAmericas 0.27191 0.03769 7.214 1.21e-12 ***
## continentAsia 0.23187 0.03785 6.126 1.38e-09 ***
## continentEurope 0.70072 0.03502 20.007 < 2e-16 ***
## continentOceania 0.85552 0.08694 9.841 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3706 on 843 degrees of freedom
## Multiple R-squared: 0.3583, Adjusted R-squared: 0.3553
## F-statistic: 117.7 on 4 and 843 DF, p-value: < 2.2e-16
plot(model_log)
Non linear relationships were left out of this analysis.
Box plots are shown below. Differences in energy use between continents are evidenced visually in the boxes, showing different Y-axis locations across continents. Africa is the continent with the smallest energy use, followed by Asia and the Americas which are similar among each other, while Europe is most similar with Oceania (albeit Oceania shows less dispersion, driven mostly by its limited number of observations).
It is to be noted that the countries with the max consumption of energy is seen in America and Europe, while the minimum is seen in Africa.
p <- df_continent %>%
ggplot(aes(
x = continent,
y = `Energy use (kg of oil equivalent per capita)`
)) +
geom_boxplot(aes(col = continent)) +
labs(title = "Energy use vs continent", x = "GPD Per Capita") +
scale_y_log10()
ggplotly(p)
Not surprisingly, ANOVA confirms that there are statistical differences between continents for the variable in question, showing a very significant p-value of 2e-16 (the interpretation of p-value is explained in more detail in the next analysis).
df_continent <- df %>%
filter(!is.na(continent) &
!is.na(`Energy use (kg of oil equivalent per capita)`))
model_log <- aov(log10(`Energy use (kg of oil equivalent per capita)`) ~
continent, df_continent)
summary(model_log)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 64.66 16.165 117.7 <2e-16 ***
## Residuals 843 115.79 0.137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_log
## Call:
## aov(formula = log10(`Energy use (kg of oil equivalent per capita)`) ~
## continent, data = df_continent)
##
## Terms:
## continent Residuals
## Sum of Squares 64.65929 115.78600
## Deg. of Freedom 4 843
##
## Residual standard error: 0.3706075
## Estimated effects may be unbalanced
Is there a significant difference between Europe and Asia with respect to
'Imports of goods and services (% of GDP)'in the years after 1990? (stats test needed)
Complete corresponding data exists for 212 rows after 1990. An ANOVA test does not show significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' with a p-value of 0.158 for the F statistic. Meaning there is a 15.8% chance that the observed variation is due to random chance, which is usually not enough to reject the null hypothesis (that the samples come from the same population). P-value should be at least lower than 0.05 to reject the null hypothesis with a 95% confidence interval.
The box plot below visually confirms the aforementioned. Both samples have a similar mean and somewhat similar interquartile range, although dispersion is higher in Asia.
df_1990 <- df %>%
filter(as.numeric(as.character(Year)) > 1990 &
continent %in% c("Europe", "Asia") &
!is.na(`Imports of goods and services (% of GDP)`)) %>%
dplyr::select(continent, `Imports of goods and services (% of GDP)`, Year)
summary(df_1990)
## continent Imports of goods and services (% of GDP) Year
## : 0 Min. : 0.07951 2002 :56
## Africa : 0 1st Qu.: 27.71081 2007 :56
## Americas: 0 Median : 38.77832 1997 :53
## Asia : 98 Mean : 44.12648 1992 :47
## Europe :114 3rd Qu.: 56.12958 1962 : 0
## Oceania : 0 Max. :183.91553 1967 : 0
## (Other): 0
anova <- aov(`Imports of goods and services (% of GDP)` ~ continent, df_1990)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 1 1347 1347.2 2.012 0.158
## Residuals 210 140594 669.5
p <- df_1990 %>%
ggplot(aes(
x = continent,
y = `Imports of goods and services (% of GDP)`, col = continent
)) +
geom_boxplot()
ggplotly(p)
What is the country (or countries) that has the highest
'Population density (people per sq. km of land area)'across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)
Average population density varies greatly from 0.139 to 14,732. The top 2 countries with the highest 'Population density (people per sq. km of land area)' across all years are Macao SAR, China and Monaco.
df_pop <- df %>%
filter(!is.na(`Population density (people per sq. km of land area)`)) %>%
group_by(`Country Name`) %>%
summarize(
AvgPopDensity =
mean(`Population density (people per sq. km of land area)`)
) %>%
arrange(desc(AvgPopDensity))
summary(df_pop)
## Country Name AvgPopDensity
## Afghanistan : 1 Min. : 0.139
## Albania : 1 1st Qu.: 20.014
## Algeria : 1 Median : 47.132
## American Samoa: 1 Mean : 261.664
## Andorra : 1 3rd Qu.: 121.271
## Angola : 1 Max. :14732.035
## (Other) :256
df_pop_top <- df_pop %>%
top_n(5, wt = AvgPopDensity)
p <- df %>%
filter(`Country Name` %in% df_pop_top$`Country Name`) %>%
ggplot(aes(
y = `Population density (people per sq. km of land area)`,
x = reorder(
`Country Name`,
-`Population density (people per sq. km of land area)`
),
col = `Country Name`
)) +
geom_boxplot() +
labs(title = "Top 5 countries by pop. density", x = "Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
ggplotly(p)
What country (or countries) has shown the greatest increase in
'Life expectancy at birth, total (years)'since 1962?
In average, all countries increased life expectancy by ~30% from 1962 to 2007 (shown as a blue dotted line in the plot), however the top countries in this ranking yielded a ~100% (or 2x) increase. Continent data was incomplete so that was left out of the analysis.
The plot below shows the distribution of life expectancy increase (Increase_LE) across countries. Something to notice is that after the top 11th country (China) the rate of decrease starts slowing down (showing an elbow in the plot). For this reason, the top 11 countries were selected as the answer to the question presented, and plotted as a bar chart at the end.
df_life_incr <- df %>%
group_by(`Country Name`) %>%
filter(!is.na(`Life expectancy at birth, total (years)`)) %>%
mutate(
`Life expectancy 1962` =
`Life expectancy at birth, total (years)`[match(1962, Year)],
.keep = "all"
) %>%
filter(!is.na(`Life expectancy 1962`)) %>%
filter(Year == 2007) %>%
mutate(
Increase_LE =
`Life expectancy at birth, total (years)` /
`Life expectancy 1962`
) %>%
dplyr::select(
continent, `Country Name`, `Life expectancy 1962`,
`Life expectancy at birth, total (years)`, Increase_LE
) %>%
arrange(desc(Increase_LE)) %>%
ungroup()
summary(df_life_incr)
## continent Country Name Life expectancy 1962
## : 0 Afghanistan : 1 Min. :28.55
## Africa : 47 Albania : 1 1st Qu.:44.82
## Americas: 24 Algeria : 1 Median :54.29
## Asia : 25 Angola : 1 Mean :54.26
## Europe : 29 Antigua and Barbuda: 1 3rd Qu.:64.73
## Oceania : 2 Arab World : 1 Max. :73.72
## NA's :109 (Other) :230
## Life expectancy at birth, total (years) Increase_LE
## Min. :44.18 Min. :0.8451
## 1st Qu.:62.03 1st Qu.:1.1477
## Median :71.34 Median :1.2569
## Mean :68.74 Mean :1.2995
## 3rd Qu.:75.15 3rd Qu.:1.4345
## Max. :82.51 Max. :2.0032
##
p <- df_life_incr %>%
ggplot(aes(y = Increase_LE, x = reorder(`Country Name`, -Increase_LE))) +
geom_point() +
labs(title = "Countries by life expectancy increase since 1962") +
geom_hline(
yintercept = mean(df_life_incr$Increase_LE),
linetype = "dotted", color = "blue"
) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
ggplotly(p)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
p <- df_life_incr %>%
top_n(11, wt = Increase_LE) %>%
ggplot(aes(
y = Increase_LE, x = reorder(`Country Name`, -Increase_LE),
fill = `Country Name`
)) +
geom_bar(stat = "identity") +
labs(
title =
"Top countries by life expectancy increase since 1962",
x = "Country"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
ggplotly(p)